Contents: Applied Statistics Project

  1. Part-A: Solution
  2. Part-B: Solution
  3. Part-C: Solution

Part-A: Solution

1. Please refer the table below to answer below questions:

Q1.png

The Background:

1A. Refer above table and find the joint probability of the people who planned to purchase and actually placed an order.

1B. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase.

2. An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.

2A. Probability that none of the items are defective?

2B. Probability that exactly one of the items is defective

2C. Probability that two or fewer of the items are defective?

2D. Probability that three or more of the items are defective

3. A car salesman sells on an average 3 cars per week.

3A. What is Probability that in a given week he will sell some cars?

3B. What is Probability that in a given week he will sell 2 or more but less than 5 cars?

3C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per week (Plotting both the CDF and PMF here.)

4. Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place an order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.

4A. What is the probability that all three orders will be recognised correctly?

4B. What is the probability that none of the three orders will be recognised correctly?

4C. What is the probability that at least two of the three orders will be recognised correctly?

5. Explain 1 real life industry scenario (other than the ones mentioned above) where you can use the concepts learnt in this module of Applied Statistics to get data driven business solution.

We can use statistics to evaluate potential new versions of a children’s dry cereal. We can use taste tests to provide valuable statistical information on what customers want from a product. The four key factors that product developers may consider to enhance the taste of the cereal are the following:

  1. Ratio of wheat to corn in the cereal flake
  2. Type of sweetener: sugar, honey, artificial or sugar free
  3. Presence or absence of flavour in the cereal - Fruits, Vegetables, Spices
  4. Cooking time - Short or Long

We should design an experiment to determine what effects these four factors had on cereal taste. For example, one test cereal can be made with a specified ratio of wheat to corn, sugar as the sweetener, flavour bits, and a short cooking time; another test cereal can be made with a different ratio of wheat to corn and the other three factors the same, and so on. Groups of children then taste-test the cereals and state what they think about the taste of each.

The Analysis of variance (ANOVA) is the statistical method we can use to study the data obtained from the taste tests. The results of the analysis may show the following:

This information can be vital to identify the factors that would lead to the best-tasting cereal. The same information can be used by the marketing and manufacturing teams, and for a better product development strategy.

Tools to be used: Python, R, Minitab, Excel, MS SQL Server

References:

Part-B: Solution

Part-B: 30 Marks


Step-1: Read, Clean, and Prepare Dataset to be used for EDA

Import the Relevant Libraries

Some Comments about the Libraries:

Read the Dataset

Shape of the Dataset

Check Information about the Data/Data types of all Attributes

Data Cleaning

Some Insights:

Some Observations:

Final Dataset after Data Cleaning

Step-2: Univariate Analysis

Univariate analysis refer to the analysis of a single variable. The main purpose of univariate analysis is to summarize and find patterns in the data. The key point is that there is only one variable involved in the analysis.

Basic Statistics

Histogram for checking the Distribution, Skewness

Box Plot to understand the Distribution

Understand the complete Dataset Distribution

Important Insights:

Explore the 'Score' Variable

Important Insights:

Explore the 'TeamLaunch' Variable

Step-3: Multivariate Analysis

Multivariate analysis is performed to understand interactions between different fields in the dataset (or) finding interactions between more than 2 variables.

Examples: Pairplot, 3D scatter plot Etc.

Covariance

Correlation

Heatmap

Scatterplot - All Variables

Scatterplot - Selected Variables

Important Insights:

Explore Tournament vs TournamentChampion vs Runnerup

Explore PlayedGames vs WonGames Vs DrawnGames vs LostGames

Explore BasketScored vs BasketGiven

Explore some other Charts

Important Insights:

Step-4: Bivariate Analysis

Through bivariate analysis we try to analyze two variables simultaneously. As opposed to univariate analysis where we check the characteristics of a single variable, in bivariate analysis we try to determine if there is any relationship between two variables.

There are essentially 3 major scenarios that we will come accross when we perform bivariate analysis:

  1. Both variables of interest are qualitative
  2. One variable is qualitative and the other is quantitative
  3. Both variables are quantitative

Explore Team vs PlayedGames with TeamLaunch

Explore PlayedGames across TeamLaunchCategory

Explore Tournament vs Team

Explore Score Vs Teams

Explore Score Vs Teams with TeamLaunchCategory

Step-5: Performance Matrix

Correlation Matrix

Finding Outliers using Z-Score

Important Insights:

Explore Win% Vs Drawn% Vs Lost%

Explore PlayedGames vs Teams vs Team Launch

Performance Report of Teams in Playing Games

Analyze Team Launch Categories

Important Insights:

Top 10 teams in the given list with Hightest Win %

Top 10 winning teams excluding very old teams

Top Teams with High Performance

Best Performance Team in the Dataset

Teams with Low Performance

Teams with high rank in position

Old Teams with less performance

Team with Most Drawn games

Old teams with low targets

Step-7: Improvements or suggestions to the association management on quality, quantity, variety, velocity, veracity etc. on the data points collected by the association to perform a better data analysis in future.

Please find below suggestions for data point collection and other relevant guidelines:

  1. Volume: We can add some more teams for better data understanding. To increase the prediction power of the dataset, we can add some other information like player information, demographics and tournament location Etc. Database systems can move from the traditional to the more advanced big data systems.

  2. Velocity: If we consider the time value of the data, It seems to be outdated and of little use; Particularly if the Big Data project is to serve any real-time or near real-time business needs. In such context we should re-define data quality metrics so that they are relevant as well as feasible in the real-time context.

  3. Variety: For better insights and modeling projects in AI and ML, we can add several other data types like structured, semi-structured, and unstructured coming in from different data sources relevant to the basketball.

  4. Veracity: We have incomplete team information. For example in Team 61: This Team don't have any information about Score, PlayedGames Etc.. It has HighestPoistionHeld as 1. Accuracy of data collection should be improved. Besides data inaccuracies, Veracity also includes data consistency (defined by the statistical reliability of data) and data trustworthiness (based on data origin, data collection and processing methods, security infrastructure, etc.). These data quality issues in turn impact data integrity and data accountability.

  5. Value: The Value characteristic connects directly to the end purpose and the business use cases. We can harness the power of Big Data for many diverse business pursuits, and those pursuits are the real drivers of how data quality is defined, measured, and improved. Data Science is already playing a pivotal role in sports analytics.

  6. Based on a strong understanding of the business use cases and the Big Data architecture, we can design and implement an optimal layer of data governance strategy to further improve the data quality with data definitions, metadata requirements, data ownership, data flow diagrams, etc.

  7. We can add more attributes to the dataset. More relevant attributes will help us to analyze teams accurately like Canceled Games, Basket Ratio, Winning Ratio, Win/Loss Percentage

  8. Simplest ML models can add more value to the present EDA use case.

  9. Big data and data science together allow us to see both the forest and the trees (Micro and Macro perspectives).

  10. Visualization, Dashboarding, and Interactivity makes the data more useful to the general public. We can use the API and deploy it on the cloud to serve our purpose in this context.

References:

  1. Towards Data Science. Sports Analytics
  2. Kaggle. Kaggle Code
  3. KdNuggets
  4. AnalyticsVidhya
  5. Wikipedia. Basketball
  6. Wikipedia. Sports Analytics
  7. Wikipedia. National Basketball Association
  8. Zuccolotto, Manisera, & Sandri. "Basketball Data Science: With Applications in R." Chapman & Hall/CRC Data Science Series, 2020. Print.
  9. Baker, & Shea. "Basketball Analytics: Objective and Efficient Strategies for Understanding How Teams Win, 2013." Print.
  10. Shea. "Basketball Analytics: Spatial Tracking." Kindle.
  11. Oliver, & Alamar. "Sports Analytics: A Guide for Coaches, Managers, and Other Decision Makers, 2013." Print.
  12. Numpy
  13. Pandas
  14. SciPy
  15. MatplotLib
  16. Seaborn
  17. Python
  18. Plotly
  19. Bokeh
  20. RStudio
  21. MiniTab
  22. Anaconda

Part-C: Solution

Part C - 15 Marks

DOMAIN: Startup ecosystem

CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.

DATA DESCRIPTION: CompanyX_EU.csv - Each row in the dataset is a Start-up company and the columns describe the company.

DATA DICTIONARY:

  1. Startup: Name of the company
  2. Product: Actual product
  3. Funding: Funds raised by the company in USD
  4. Event: The event the company participated in
  5. Result: Described by Contestant, Finalist, Audience choice, Winner or Runner up
  6. OperatingState: Current status of the company, Operating ,Closed, Acquired or IPO

*Dataset has been downloaded from the internet. All the credit for the dataset goes to the original creator of the data.

PROJECT OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.

1. Read the CSV file

2. Data Exploration

2A. Check the datatypes of each attribute.

2B. Check for null values in the attributes.

3. Data preprocessing & visualisation:

3A. Drop the null values.

3B. Convert the ‘Funding’ features to a numerical value.

3C. Plot box plot for funds in million.

3D. Check the number of outliers greater than the upper fence.

3E. Check frequency of the OperatingState features classes.

4. Statistical Analysis

4A. Is there any significant difference between Funds raised by companies that are still operating vs companies that closed down?

Important Insights:

4B. Write the null hypothesis and alternative hypothesis.

The two hypotheses for this particular two sample t-test are as follows:

4C. Test for significance and conclusion

Assumptions:

Note: Two sample t-test is relatively robust to the assumption of normality and homogeneity of variances when sample size is large (n ≥ 30) and there are equal number of samples (n1 = n2) in both groups.

If the sample size small and does not follow the normal distribution, We should use non-parametric Mann-Whitney U test (Wilcoxon rank sum test).

Conduct a two sample t-test:

Next, we’ll use the ttest_ind() function from the scipy.stats library to conduct a two sample t-test, which uses the following syntax:

ttest_ind(a, b, equal_var=True)

where:

Before we perform the test, we need to decide if we’ll assume the two populations have equal variances or not. As a rule of thumb, we can assume the populations have equal variances if the ratio of the larger sample variance to the smaller sample variance is less than 4:1.

Intepretation:

  1. Because the p-value of our test (0.00789) is less than alpha = 0.05, we reject the null hypothesis of the test.

  2. We do have sufficient evidence to say that there is a significant difference between the funds raised by the companies that are operating vs the companies that are closed.

4D. Make a copy of the original data frame.

4E. Check frequency distribution of Result variables.

4F. Calculate percentage of winners that are still operating and percentage of contestants that are still operating

4G. Write your hypothesis comparing the proportion of companies that are operating between winners and contestants:

The two hypotheses for this particular two sample z-test are as follows:

4H. Test for significance and conclusion

Intepretation:

  1. Because the p-value of our test (0.000) is less than alpha = 0.05, we reject the null hypothesis of the test.

  2. We do have sufficient evidence to say that there is a significant difference between the proportion of operating companies in two classes like winners and contestants.

4I. Select only the Event that has ‘disrupt’ keyword from 2013 onwards.